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Abstract 

This paper deals with nonparametric estimation of conditional den¬ 
sities in mixture models in the case when additional covariates are 
available. The proposed approach consists of performing a prelim¬ 
inary clustering algorithm on the additional covariates to guess the 
mixture component of each observation. Conditional densities of the 
mixture model are then estimated using kernel density estimates ap¬ 
plied separately to each cluster. We investigate the expected Li-error 
of the resulting estimates and derive optimal rates of convergence over 
classical nonparametric density classes provided the clustering method 
is accurate. Performances of clustering algorithms are measured by 
the maximal misclassification error. We obtain upper bounds of this 
quantity for a single linkage hierarchical clustering algorithm. Lastly, 
applications of the proposed method to mixture models involving elec¬ 
tricity distribution data and simulated data are presented. 
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1 Introduction 


Finite mixture models are widely used to account for population hetero¬ 
geneities. In many fields such as biology, econometrics and social sciences, 
experiments are based on the analysis of a variable characterized by a differ¬ 
ent behavior depending on the group of individuals. A natural way to model 
heterogeneity for a real random variable Y is to use a mixture model. In this 
case, the density f of Y can be written as 

M 

f{t) = Y,aifi{t), feM. (1.1) 

i=l 

Here M is the number of subpopulations, and fi are respectively the 
mixture proportion and the probability density function of the subpop¬ 
ulation. We refer the reader to Everit and Hand (1981), McLachlan and 
Basford (1988), McLachlan and Peel (2000) for a broader picture of mixture 
density models as well as for practical applications. 

When dealing with mixture density models such as (1.1), some issues arise. 
In some cases, the number of components M is unknown and needs to be 
estimated. To this end, some algorithms have been developed to provide 
consistent estimates of this parameter. For instance, when M corresponds 
to the number of modes of /, Cuevas et ah (2000) and Biau et ah (2007) 
propose an estimator based on the level sets of /. Model identihability is 
an additional issue that has received some attention in the literature. Actu¬ 
ally, model (1.1) is identihable only by imposing restrictions on the vector 
(ai, ..., aju, fi, ■ ■ ■, Jm)- In order to provide the minimal assumptions such 
that (1.1) becomes identihable, Celeux and Govaert (1995), Bordes et ah 
(2006) (see also the references therein) assume that the density functions /j’s 
belong to some parametric or semi-parametric density families. However, in 
a nonparametric setting, it turns out that identihability conditions are more 
difficult to provide. Hall and Zhou (2003) dehne mild regularity conditions 
to achieve identihability in a multivariate nonparametric setting while Kita- 
mura (2004) considers the case where appropriate covariates are available. 


When the model (1.1) is identihable, the statistical problem consists of esti¬ 
mating mixture proportions a* and density functions fi. In the parametric 
case, some algorithms have been proposed such as maximum likelihood tech¬ 
niques (Lindsay (1983a,b), Redner and Walker (1984)) as well as Bayesian 
approaches (Diebolt and Robert (1994), Biernacki et ah (2000)). When the 
/j’s belong to nonparametric families, it is often assumed that training data 
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are observed, be., the component of the mixture from which Y is distributed 
is available. In that case, the model is identihable and some algorithms allow 
to estimate both the ads and the /ds (see Titterington (1983), Hall and Tit- 
terington (1984, 1985), Cerrito (1992)). However, as pointed out by Hall and 
Zhou (2003), inference in mixture nonparametric density models becomes 
more difficult without training data. These authors introduce consistent 
nonparametric estimators of the conditional distributions in a multivariate 
setting. We also refer to Hordes et ah (2006) who provide efficient estimators 
under the assumption that the unknown mixed distribution is symmetric. 
These estimates are extended by Benaglia et ah (2009, 2011) for multivariate 
mixture models. 

The framework we consider takes place between the two above situations. 
More precisely, training data are not observed but we assume to have at 
hand some covariates that may provide information on the components of the 
mixture from which Y is distributed. Our approach consists of performing 
a preliminary clustering algorithm on these covariates to guess the mixture 
component of each observation. Density functions fi are then estimated using 
a nonparametric density estimate based on the predictions of the clustering 
method. 

Many authors have already proposed to carry out a preliminary clustering 
step to improve density estimates in mixture models. Ruzgas et ah (2006) 
conduct a comprehensive simulation study to conclude that a preliminary 
clustering using the EM algorithm allows to some extent to improve per¬ 
formances of some density estimates (see also Jeon and Landgrebe (1994)). 
However, to our knowledge, no work has been devoted so far to measure the 
effects of the clustering algorithm on the resulting estimates of the distribu¬ 
tion functions /j. This paper proposes to hll this gap, studying the Li-error of 
these estimates. To do so, we measure the performance of clustering meth¬ 
ods by the maximal misclassihcation error (2.3). This criterion allows us 
to derive optimal rates of convergence over classical nonparametric density 
classes, provided the clustering method used in the hrst step performs well 
with respect to this notion. 

The paper is organized as follows. In Section 2, we present the two-step 
estimator and give the main results. Examples of clustering algorithms are 
worked out in Section 3. In particular, the maximal misclassihcation error of 
a hierarchical clustering algorithm is studied under mild assumptions on the 
model. Applications on simulated and real data are presented in Sections 4 
and 5. A short conclusion including a discussion of the implications of the 
work is given in Section 6 and proofs are gathered in Section 7. 
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2 A two-step nonparametric estimator 

2.1 The statistical problem 

Our focus is on the estimation of conditional densities in a univariate mixture 
density model. Formally we let (y, I) be a random vector taking values in 
M X |1, M] where M > 2 is a known integer. We assume that the distribution 
of Y is characterized by a density / dehned, for all f G M, by 

M 

i=l 


where, for all i G a* = P(/ = i) are the prior probabilities (or the 

weights of the mixture) and ft are the densities of the conditional distribu¬ 
tions C{Y\I = i) (or the components of the mixture). 

If we have at hand n observations (Fi, Ji),..., (W, In) drawn from the distri¬ 
bution of {Y,I), one can easily hnd efficient estimates for both the ads and 
the /j’s. For example, if we denote W = # ^ [1, nj : = i}, then we can 

estimate a* using the empirical proportion dj = Ni/n and /* by the kernel 
density estimate fi dehned for alH G M by 

1 ” 

Mt) = ^J2Kh{t,Yk)Ulk) ( 2 . 1 ) 

k=i 

if W > 0. For the dehniteness of fi we conventionally set fi{t) = 0 if W = 0. 
Here K is a, kernel which belongs to Li(]R, M) and such that J K = 1, h > 0 
is a bandwidth and 

( 2 . 2 ) 

is the classical convolution kernel located at point t (see Rosenblatt (1956) 
and Parzen (1962) for instance). Estimate (2.1) is just the usual kernel 
density estimate dehned from observations in the subpopulation. It follows 
that, under classical assumptions regarding the smoothing parameter h and 
the kernel iF, fi has similar properties as those of the well-known kernel 
density estimate. In particular, the expected Li-error 

Eii/,-/dii = E/ \m - fi{t)\dt 

JM. 

achieves optimal rates when fi belongs to regular density classes such as 
Holder or Lipschitz classes (see Devroye and Gyorh (1985)). 
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The problem is more complicated when the random variable / is not ob¬ 
served. In this situation, dj and /j are not computable and one has to hnd 
another way to define efficient estimates for both Oj and /j. In this work, 
we assume that one can obtain information on I through another covariate 
X which takes values in where d > 1. This random variable is observed 
and its conditional distribution C{X\I = i) is characterized by a density 
Qi = gi,n : —j- M which could depend on n. In this framework, the 

statistical problem is to estimate both the components and the weights of 
the mixture model (1.1) using the n-sample {Yi,Xi), ..., (Yn,Xn) extracted 
from (Yi, Xi, Ii),..., {Yn, Xn, In) randomly drawn from the distribution of 
{Y,X,I). 

2.2 Discussion on the model 

Estimating components of a mixture model is a classical statistical problem. 
The new feature proposed here is to include covariates in the model which can 
potentially improve traditional algorithms. These covariates are represented 
by a random vector X which provides information on the unobserved group I. 
This model includes many practical situations. Three examples are provided 
in this section. 

The classical mixture problem without covariates. A traditional 
problem in mixture models is the estimation of the components fi,i G |1, M] 
in (1.1) from (only) an i.i.d sample Yi ,... ,Yn drawn from /: no covariates 
are available. In this context, many parametric methods such as the EM 
algorithm (and its derivatives) as well as nonparametric procedures (under 
suitable identifiability constraints) can be used and are widely studied. Even 
if this model is formally a particular case of ours (we just have to take 
X = Y), the approach presented in this paper is not designed to be compet¬ 
itive in this situation with dedicated parametric or nonparametric methods. 
Indeed, our model focus on practical situations where covariates can be used 
to obtain useful information about the hidden variable I. Below, we offer 
two realistic situations where such covariates are naturally available. 

Medical example. Many diseases evolve over time and exhibit different 
stages of development which can be represented by a variable I that takes 
a finite number of values. In many situations, the problem is not to study 
the stage I but some variables that can potentially have different behavior 
according to I. For instance, the survival time Y and its conditional dis¬ 
tributions with respect to I are typically of interest in many situations. In 
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practice, the stage / is generally not observed. It is assessed by the medical 
team from several items such as physiological data, medical examinations, 
interviews with the patient (and so on) that can be represented by covariates 
X in our model. 

Electricity distribution. A distribution network may locally experience 
minor problems, due for example to bad weather, that may affect some cus¬ 
tomers during a hxed period of time in a given geographical area. To better 
understand the origin and/or consequences of the dysfunctions, and thus 
better forecast network operations, electricity distributors are interested in 
the distribution behavior of several quantities Y for two different groups of 
customers: those affected by the malfunction and the others. Variables Y 
may for instance represent averages or variations of consumption after the 
disruption period. In this situation the group is represented by a variable 
I: I = 1 for the users affected by the disruption and 2 for the others. This 
binary variable I is not directly observed but it can be guessed from individ¬ 
uals curves of consumptions during the disruption period. In our framework, 
discrete versions of these curves correspond to the covariate X. This example 
is explained in-depth and analyzed in Section 5 using real data from ERDF, 
the main French distributor of electricity. 

2.3 A kernel density estimate based on a clustering 
approach 

To estimate densities /* of the conditional distributions C{Y\I = i),i G 
| 1 , M], we propose a two-step algorithm that can be summarized as follows. 

1. Apply a clustering algorithm on the sample Xi ,to predict the 
label Jfc of each observation 

2. Estimate conditional densities /* by kernel density estimates (2.1) where 
unobserved labels are substituted by predicted labels. 

Formally, we hrst perform a given clustering algorithm to split the sample 
Xi,...,X„ into M + 1 clusters Xq, Xi,..., Xm such that A) 7 ^ 0 for all 
i G |1, M]. Clusters Xq, Ai, ..., Xm satisfy 

M 

UA, = {Xi,...,XJ and 7 ^ j. A, n A, = 0. 

1=0 

We do not specify the clustering method here, some examples are discussed 
in Sections 3 and 4. Observe that we dehne M + 1 clusters instead of M. 
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The cluster Tfl (which could be empty) contains the observations for which 
the clustering procedure is not able to predict the label. For example, if the 
clustering procedure reveals some outliers, they are collected in A!q and we 
do not use these outliers to estimate the /ds. 

Once the clustering step is performed, we define the predicted labels Ik as 

Tk = i if Xk e Xi, fce|i,n], fe|i,M]. 

Observation Xk is not correctly assigned to its group with probability P(Jfc ^ 
Ik). We measure the performance of the clustering algorithm by the maximal 
probability to not correctly attribute an observation: 

(pn = max P(4 ^ Ik). (2.3) 

l<A:<n 

We call this error term the maximal misclassification error. It will be studied 
for two clustering algorithms in Section 3. 

To define our estimates, we just replace in (2.1) the true labels Ik by the 
predicted labels Ik. Formally, prior probabilities ai are estimated by 

iV- — 

= ~ where W = #{/c G |1, nj : h = i}, 

while for the conditional densities /*, we consider the kernel density estimator 
with kernel iF : M —)■ M and bandwidth h > 0 

- 1 1 ” 

= E AV<.n) = ^E/v(i,n)i(.)( 4 ). (2.4) 

Xi k:Xk&X^ Xi k=l 

where is defined in (2.2). Observe that since for all i G |1, M] the clusters 
Xi are nonempty, the estimates /j are well defined. 

Kernel estimates fi are dehned from observations in cluster Xi. The under¬ 
lying assumption is that, for all i G |1,M], each cluster A) collects almost 
all of the observations Xk such that W is randomly drawn from j). Under 
this assumption, ipn is expected to be small and fi to be closed to the oracle 
estimates fi defined by equation (2.1). This closeness is measured in the fol¬ 
lowing theorem which makes the connection between the expected Li-errors 
of fi and fi. 


Theorem 2.1 There exist positive constants A 1 — A 3 such that, for alln > 1 
and i G |1, M] 


E 


fi - fi 


< E 


fi- fi + Aipin + A 2 exp(-n) 


(2.5) 
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and 


( 2 . 6 ) 


E|3i - ail < ^n + 



Constants Ai — are specified in the proof of the theorem. We emphasize 
that ineqnalities (2.5) and (2.6) are non-asymptotic, that is, the bonnds are 
valid for all n. If we intend to prove any consistency resnlts regarding /* and 
Sj, ineqnality (2.5) says that the maximal misclassihcation error (pn should 
tend to zero. Moreover, if (pn tends to zero much faster than the Li-error of 
fi, then the asymptotic performance is guaranteed to be equivalent to the 
one of the oracle estimate fi. The Li-error of fi, with properly chosen band¬ 
width h and kernel K, is known to go to zero, under standard smoothness 
assumptions, at rate n 2 s+i where s > 0 is typically an index representing the 
regularity of fi. For example, when we consider Lipschitz or Holder classes of 
functions with compact supports, s corresponds to the number of absolutely 
continuous derivatives of the functions fi. In this context, if (pn = 0{n~ 
then 


E 


h-f^\l = 0{n-^^). 


Remark 2.1 Note that even if clusters Xi,..., Xm are arbitrarily indexed, 
inequalities (2.5) and (2.6) are true whatever the choice of the indexes. How¬ 
ever, when indexes are not chosen according to the true labels, ipn could be 
large even if the clustering procedure performs well. In this situation there 
exists a permutation of the indexes such that, after this permutation, the max¬ 
imal misclassification error is small. More precisely it can be readily seen, 
using Theorem 2.1, that 


min E 

ttSHm 


fiT{i) fi 


< E 


fi- fi + Ai min ipn{7i) + A 2 exp(-n) (2.7) 

tGIIm 


where Hm denotes the set of all permutations 0/ 11 , M] and (pn('^) is the max¬ 
imal misclassification error of the clustering method after the permutation of 
the indexes: 


p>n{T^) = max P(7r(4) ^ 4), vr G Hm- 


( 2 . 8 ) 


Remark 2.2 As usual, the choice of the bandwidth h reveals crucial for the 
performance of the kernel density estimates. However, this paper does not 
provide any theory to select this parameter. If automatic or adaptive pro¬ 
cedures are needed, they can be obtained by adjusting traditional automatic 
selection procedures for classical nonparametric estimators (see for exam¬ 
ple Berlinet and Devroye (1994) or Devroye and Lugosi (2001)). 
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3 Clustering procedures 

The proposed procedure requires a preliminary clustering algorithm per¬ 
formed on the sample Xi,... Even if any clustering algorithm could 

be applied in practice, it should be chosen according to the conditional dis¬ 
tributions C{X\I = i),z G More precisely, each cluster should match 

up with observations drawn from one of those conditional distributions. From 
a theoretical point of view, for a given clustering procedure, the problem is 
to hnd upper bounds for the maximal misclassification error ipn to apply 
Theorem 2.1. In a parametric setting, i.e., when conditional distributions 
are identihed by unknown parameters, clustering algorithms are often based 
on efficient estimators of these unknown parameters. We provide an exam¬ 
ple in Section 3.1. Without parametric assumptions on the distribution, the 
problem is more complicated. Contrary to data analysis methods such as 
regression or classification, there are many ways to define clustering. One of 
the most popular approach consists of defining clusters as the connected com¬ 
ponents of the level sets of the density (see Hartigan (1975)). This amounts 
to saying that clusters represent high density regions of the data separated by 
low density regions. In this context, many authors have studied theoretical 
performances of clustering algorithms based on neighborhood graphs such 
as hierarchical or spectral clustering algorithms. In Section 3.2, we extend 
results of Maier et al. (2009) and Arias-Castro (2011) to our framework for a 
hierarchical clustering algorithm based on pairwise distances. This procedure 
is challenged with other clustering methods in the simulation part. 

3.1 A parametric example 

We consider a mixture of two uniform univariate densities 


9iA^) = 9 i{x) = I[o,i](a:) and g 2 ,n{x) = I[i_A„, 2 -A„](a:), 


where we recall that gi^n is the density of the conditional distribution C{X\I = 
i),i = 1,2. Here (A„)„ is a non-increasing sequence which tends to 0 as u 
goes to infinity. In this parametric situation, a natural way to guess the 
unobserved label Ik of the observation Xk is to find an estimator A„ of A„ 
and to predict the labels (see Figure 1) according to 



(3.1) 
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The accuracy of these predictions depends on the choice of the estimator A„. 
Here we choose A„ = 2 — where = maxi<fc<„ X^. Note that in this 
situation, we have for i = 1, 2 

I]^ — % y — z, a.s. 


It means that all classihed observations (with non-zero estimated label) are 
well-classified and that misclassified observations are collected in Xq (see 
Figure 1). 


Ji, = 1 or 2 


Ik = 1 


lk=0 


Ik =‘2 


-•-•-•-•- 


• • • -•- 


Figure 1; A sample of n = 11 points. 


The following proposition establishes a performance bound for the maximal 
misclassification error (p„ of this clustering procedure. 

Proposition 3.1 There exists a positive constant such that for alln>l 


Tn ^ ^nA A4 


logn 

n 


Unsurprisingly, (fn decreases as An decreases. Moreover, since in most cases 
of interest, the expected Li-error of /j tends to zero much slower than I/a/u, 
this property means that, asymptotically, the expected Li-error of /j is of 
the same order as the expected Li-error of /* provided An = 0 (l/^/n) (see 
(2.5)). 


3.2 A hierarchical clustering algorithm 

Assuming that clusters are dehned as connected components of level sets of 
a density, many authors have studied theoretical properties of various clus¬ 
tering algorithms. For instance, Maier et ah (2009) and Arias-Castro (2011) 
prove that algorithms based on pairwise distances (/c-nearest neighbor graph, 
spectral clustering...) are efficient as soon as these connected components are 
separated enough. In this section, we extend results of these authors to bound 
the maximal misclassihcation error (p„ for a hierarchical clustering algorithm. 
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3.2.1 The clustering algorithm 


Given Xi,..., Xn, we consider a single linkage hierarchical clnstering al¬ 
gorithm based on pairwise distances to extract exactly M disjoint clusters 
Xi,, Xm from the observations (see Arias-Castro (2011)). This algorithm 
consists of finding a data-driven radius > 0 such that the set 




(3,2) 


k=l 


has exactly M connected components. Here B{x,r) stands for the closed 
Euclidean ball with center x G and radius r > 0. Cluster A) is then 
naturally composed by observations X^ which belong to the connected 
component of the set (3.2). 

The radius can be defined in a formal way to derive statistical properties 
of the clustering procedure. To this end, we define for each positive real 
number r the n x n affinity matrix A'’ = {Al ^)i<k,e<n by 


A'^ — 

^k,£ — 


if liXfc - X, 
otherwise, 


I2 < 


B{Xk,r)nB{Xi,r) ^ 


(3.3) 


where i|x ||2 stands for the Euclidean norm of x G M'^. This matrix induces 
a non-orientated graph on the set |l,n] and two different observations X^ 
and X^ belong to the same cluster if k and ^ belong to the same connected 
component of the graph. We let Mr be the number of connected components 
of the graph and we denote by Ai(r),..., (r) the associated clusters. The 

radius is selected as follows 

Tn = inf{r > 0 : Mr < M}. 

Note that is well-dehned since the random set IZm = {r > 0 : Mr < M} 
is lower bounded (by 0) and non-empty since r* = max^ ^ \\Xk — ^^112 always 
belongs to this set [Mr* = 1). Moreover, since r 1 —)■ Mr is non-increasing 
and right-continuous, one can easily prove that = minT^jvr and = 
M almost surely when n > M. Let Ai(f„),... ,XM{rn) be the M clusters 
induced by A^", the aim is to study the maximal misclassification error (2.3) 
of this clustering algorithm. 

Remark 3.1 The algorithm requires that the connected components of the 
graph induced by the n x n matrix AT be computed for different values of 
r. Some algorithms can be performed to obtain these connected components. 
For instance, we can use the Depth-First search algorithm (see Cormen et al. 
(1990)) which can be performed efficiently in 0(Vn + Bn) operations, where 
Vn and En denote respectively the number of vertices and edges of the graph. 
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3.2.2 The clustering model 

Recall that the clustering algorithm is performed on the sample Xi,, X^. 
To study the maximal misclassihcation error, some assumptions on the dis¬ 
tribution of X are needed. 


Assumption 1 Let denotes the probability density of X. We assume 
that there exists a positive sequence {tn)n such that the set 

{x e : gn{x) > tn} (3.4) 


has exactly M disjoint connected compact sets ..., SM,n satisfying, for 
alHe|l,M], 

P(Xi G Si^n\h = i) = [ gi,n{^) da; > 1/2, (3.5) 

where we recall that gi^n stands for the density of the conditional distribution 
jC{X\I = i),i G |1,M]. We note S'„ = 5'*,^ and 


where 


6 n = inf dist{Si^n,Sj^n), 




dist(S'i,n,*S'j>) = inf 

O'j f 


inf 

y&Sj^n 


X 


Vh 


Assumption 2 There exist two positive constants ci and C 2 , and a family 
of A G N* Euclidean balls with radius r„/2 such that 

\ Leb(^„) > Cl E£=i Leb(^„ n R,) 

iv£ = l,...,A, Leb(5„ n R,) > C 2 r)(, 


where Leb denotes the Lebesgue measure on and is defined by 

r logn 


rf = 


ntr. 


with r > l/c 2 . 


Assumption 1 is classical to study performances of clustering algorithm (see 
Maier et al. (2009)) or to estimate the number of clusters (see Biau et ah 
(2007)). It implies that clusters reflect high-density regions separated by low- 
density regions. Condition (3.5) is required to be sure that the connected 
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components of (3.4) are correctly indexed. It makes it possible to avoid that 
most of the observation in Si^n are drawn from with j ^ i. Assumption 2 
is more technical and pertains to the diameter and regularity of the sets Si^n- 
Our approach consists of identifying sets Si^n with the connected components 
of Ufc=i B{Xk,r). Thus, when diameter of Si^n increases, large values of radius 
r are necessary to connect observations in Si^n- However for too large values 
of r, the number of connected components of Ufc=i B{Xk, r) becomes smaller 
than M and the method fails. Consequently, we need to constraint the 
diameter of This is ensured by assumption 2 since it implies that S'„ 
can be covered by N Euclidean balls such that 


n 


(3.6) 


N < 


C 1 C 2 T logn 


Finally, inequality Leb(S'„ fl B^) > C 2 r‘^ in assumption 2 can be seen as a 
smoothness assumption on the boundaries of S'„ (see Biau et al. (2008)). 

Remark 3.2 In dimension 1, since each Si^n is connected, it is a segment 
of the real line. Thus, under assumption 1, its diameter is bounded by 
and assumption 2 is satisfied. For higher dimensions, things turn out to 
be more complicated. Indeed, even if the measure of the compact set Sn is 
upper bounded by 1/tn, its diameter can be as large as we want. Consider for 
example the density 


hn{x,y) = I[i_i/a„,a„](a:)I[ 0 ,i/x 2 ](l/), {x,y) e M+* X M+, 


where > 1. Since a„ could be chosen to be arbitrarily large, the diameter 
of Sn could also be arbitrarily large and assumption 2 does not hold. This 
assumption restricts to some extent the shape of Sn. It is satisfied for regular 
sets such that the diameter does not increase too quickly as n goes to infinity. 
For example, consider the two dimensional situation where Sn is a rectangle 
with length Un and width Vn. In such a scenario, one can easily prove that 
if there exist two positive constants ai and 02 such that Un > aiVn and Vn > 
a 2 rn, then assumption 2 holds. Note also that this assumption is verified 
for sets Sn that do not depend on the sample size n with smooth boundaries 
(see Biau et al. (2007), Maier et al. (2009)). 

Remark 3.3 Assumption 1 is clearly satisfied when supports of conditional 
densities gi^n are disjoint. This assumption could also be verified when these 
supports overlap. As an example, consider the Laplace mixture model: 
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where > 0 and fii^n G (see Figure 2). Let — ^ 2 ,n\ be the 

distance between the two location parameters and p. 2 ,n o.nd define 


n 


^J0ll0^2 


(7 ri 


exp 


2a„ 


and 

tin = ^ “ «i)exp i = 1,2. 

Then direct calculations yield that for any tn G (t*,n,^in bhe level set 

{«i5'i,n + 0 ^ 292 ,n > tn} has cxactly M = 2 connected components provided 
log(ai/(l - ai)) e (-4/o-n,4/o-n)- 



Figure 2: Connected components of level sets for a mixture of Laplace distribu¬ 
tions. 

3.2.3 The maximal misclassification error 

The algorithm described in Section 3.2.1 provides a partition of {Xi,..., X„} 
into M clusters Ti(f„),..., XM^Vn). To apply Theorem 2.1, we have to hnd 
an upper bound of the maximal misclassification error for the predicted rule 

Ik i '' '' Xk G Xfirn) ■ 

Observe that, for this clustering algorithm, clusters Ti(f„),... de- 

hned in Section 3.2.1 are arbitrarily indexed. Thus there is no guarantee that 
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the predicted labels are correctly indexed. To circumvent this problem, as 
suggested in Remark 2.1, we study the maximal misclassihcation error up to 
a permutation of the indexes. 

The proposed clustering algorithm has been studied by Maier et al. (2009) 
and Arias-Castro (2011). They prove that each cluster corresponds to one of 
the connected components of (3.4) with high probability in a model similar 
to ours. In other words, clusters make it possible to identify each connected 
components of (3.4). Even if the identification of these connected compo¬ 
nents is important in our setting, it is not sufficient since our goal is to find 
an upper bound of the misclassihcation error (2.8). Moreover, since supports 
of conditional densities Qi^n can overlap, observations in the connected com¬ 
ponents Si^n of (3.4) are not guaranteed to emerge from the distribution of 
C{X\I = i). This leads us to define 

= max P(Xi ^ {Si^n + rn)\Ii = i) 

where for S' C and r > 0 

S + r = {x : 3y e S such that ||a: — y \\2 < r}. 

Observe that ipn is the maximal probability that an observation from the 
group does not belong to Si,„ -f- r„. This parameter reflects the degree of 
difficulty for the model to correctly predict the label of the observations: the 
larger -0^, the more difficult it is. We can now set forth the main result of 
this section. 


Theorem 3.1 Suppose that Assumption 1 and Assumption 2 hold, 
over, if 


d-n ^ 


2 


/ rlognV'^'^ 

V ntn ) 


More- 

(3.7) 


then for all 0 < a < C 2 T — 1, we have 


min maxP(7r(4) 7^ 4) < - \-{n + 2)^/>„, 

■kgUm n°-[ogn 


(3.8) 


where A^ is positive constant. 


This theorem provides minimal assumptions to make accurate predictions 
of the labels 4- Inequality (3.7) gives the minimum distance between the 
connected components S'j^„ to make the clustering method efficient. When 
supports of the conditional densities gt^n are disjoints, it is easily seen that 
f)n = ^ and Ik = Ik almost surely for n large enough provided inequality 
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(3.7) is satisfied. When the supports overlap, inequality (3.8) ensures that 
the algorithm performs well provided the probability 'i/'n tends to zero much 
faster than 1/n. In the Laplace example presented in Remark 3.3, it can be 
easily seen that 

It implies that as soon as ^n/o'n > 31og(?7,)/2, mjj{n) < and the kernel 

density estimates dehned in (2.4) satisfy 

mm E||/^(i) - fill < E\\fi - fi\\ + 
ttsHm \/n 


Finally, note that when 'ipn = 0, inequality (3.8) implies that each cluster 
belong to one of the connected components of (3.4) with high proba¬ 
bility. This result was obtained by Arias-Castro (2011) in a context similar 
to ours under assumption (3.7). Theorem 3.1 extends this result for > 0. 
Note also that proof of this theorem (see Section 7) is different from Arias- 
Castro (2011) and rely on support density estimation tools proposed by Biau 
et ah (2008). 

4 Simulation study 

In this section, we provide simulation results enlightening the efficiency of 
the proposed estimator. To this end, Y is simulated from mixtures of uni¬ 
variate Gaussian laws whereas several scenarios on the distribution of X are 
considered. 

To illustrate Theorem 2.1 and Theorem 3.1, we compare the accuracy of 
our two-step estimate fi (see (2.4)) with the accuracy of the oracle estimate 
fi (see (2.1)). Such comparisons are made in both Sections 4.1 and 4.2. 
However, each of these sections focus on special points. 

In Section 4.1, the two-step estimate is also compared with the classical EM 
algorithm. Even if this algorithm is known to be efficient under the para¬ 
metric assumption made on the distribution of E, it does not take advantage 
of the presence of covariates X. It allows our method to outperform the EM 
algorithm in favorable situations. 

In Section 4.2, different clustering procedures on X are considered on several 
classical data sets. In particular the behavior of the spectral clustering and 


16 



the fc-means algorithm are studied. Both of them are compared with the 
hierarchical method studied in Section 3.2. 

4.1 Comparison with the EM algorithm 

In this simulation section, density of Y is given by 

f{t) = ^fi{t)+ ^f 2 {t), teR 

where /i and /2 stand for the densities of the normal distribution with mean 
— A and A and variance 1. Parameter A measures the separation between 
the components /i and /2 (see Figure 3). 






Figure 3: Density of Y for various values of A. 

Two scenarios are considered for the distribution of X. In the hrst one, 
conditional densities gi^n,^ = 1,2 are uniform univariate densities: 

gi,n{x) = I]o,i[(a:) and g2,n{^) = ^I]i+5„,3+5„[(a^), x eR 

where > 0 measures the distance between the supports of gi^n and ( 72 ,n- For 
the second one, we consider the mixture of Laplace distributions discussed 
in Section 3.2.2: conditional densities gi,n,i = 1,2 are given by 

= ^exp , i = 1,2, 

20 "^, \ (Jn ) 
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where an = = 1 and /i 2 ,n = /^i,n + where in > 0. Observe that 

supports of Qi^n are disjoints in the uniform scenario while they overlap in 
the Laplace example. The separation between these conditional distributions 
is represented by the location parameters dn and in- 

For the two proposed scenarios, estimators fi and /2 defined in (2.4) are 
computed using the hierarchical clustering procedure proposed in Section 3.2. 
These estimates are compared in terms of Li-error with the oracle (but un¬ 
observable) estimates fi and /2 dehned in (2.1). Nonparametric kernel esti¬ 
mates fi and fi are computed with a Gaussian kernel. Recall that this paper 
does not put forth any theory for selecting the bandwidth h in an optimal way 
(see Remark 2.2). Here we use the default data-driven procedure proposed 
in the GNU-R library np (see Hayheld and Racine (2008)). In addition, 
these nonparametric density estimates are compared with the EM algorithm 
(Dempster et ah (1977)) known to perform well to estimate parameters in 
a Gaussian mixture model. Formally, we run this algorithm on the sample 
Yi,... ,Yn to estimate Gaussian parameters of fi and / 2 . We use the GNU-R 
library mclust and denote by and /|”^ the resulting estimates. They 
are used as a benchmark. We set n = 300 and, for the sake of clarity, we 
present the results regarding fi only since conclusions are the same for / 2 . 
Table 1 presents, for different values of A, 6 n and in, the ratio 


E||/r-/iiii 


(4.1) 


where fi is either /i or /i. Expectations are evaluated over 500 Monte Garlo 
replications. 



Uniform: 

77(/i) for 6 n = ... 

Laplace 
77(/i) for in 

— 

nil) 

0.03 

0.05 

0.1 

4.5 

5.5 

6.5 

A = 0.1 

0.636 

0.563 

0.464 

0.817 

0.509 

0.476 

0.464 

A = 0.5 

1.156 

0.923 

0.679 

1.261 

0.749 

0.692 

0.679 

A = 1 

1.772 

1.288 

0.844 

1.769 

0.954 

0.869 

0.843 

A = 2 

4.243 

2.876 

1.702 

4.298 

2.093 

1.830 

1.701 


Table 1; Li-ratio (4.1) evaluated over 500 replications. 


As expected, the performances of the EM algorithm clearly depend on the 
separation distance between the target densities fi and / 2 . For large A 
values, parametric estimates resulting from the EM algorithm outperform 
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Figure 4: Boxplot of the Li-error for the estimate /f™ (EM), the oracle esti¬ 
mate /i (OR) and the two-step estimate /i (TS) for the Laplace example. The 
separation distance A between /i and /2 vary from 0.1 (left) to 2 (right) and 
4 = 5.5. 

the nonparametric estimates proposed in this paper (e.g. A = 2 in Figure 4). 
This is not the case when fi is closed to f 2 '- Li-performance of fi over 4”^ 
is signihcantly better for A = 0.1 and A = 0.5 and roughly similar for 
A = 1. Note also that the Li-error of fi does not depend on A (see Figure 
4). Figure 5 displays scatterplots of the Li-error of fi versus those of the 
oracle /i for A = 1. As proved in Theorem 2.1, most points are above 
the diagonal. The distance from a point to the hrst bisector measures to 
some extent the distance between fi and fi in terms of Li-error. The closer 
to the bisector, the better /i. In other words, this distance represents the 
performance of the clustering algorithm. We observe that points move closer 
to the hrst bisector as separation parameters 4 and 4 increase. As explained 
in Theorem 3.1, performances of the hierarchical clustering algorithm depend 
on the separation parameters 4 and 4: when these parameters increase, 
performances of fi become similar to those of the oracle fi. Indeed, in our 
simulations, we observe that Li-error of fi and /i are quite the same for 
4 = 0.1 (resp. 4 = 6.5) in the uniform case (resp. Laplace case). 

4.2 A comparison of clustering algorithms 

As discussed in Section 3, any clustering algorithm could be applied in prac¬ 
tice. However, it is clear that Li-performances of the proposed estimate 
depend largely on the performances of the clustering method. The problem 
is to hnd the appropriate clustering algorithm according to the covariates 
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Figure 5: Li-error of/i (x-axis) and fi (y-axis) for the uniform (up) and Laplace 
(down) example. 


X. In this section, we propose to compare three standard clustering pro¬ 
cedures: the hierarchical clustering algorithm presented in Section 3.2, the 
spectral clustering algorithm performed with a Gaussian kernel (see Arias- 
Castro (2011)) and the fc-means algorithm. 

The model is as follows. The density of Y is now given by 

fit) = ^h{t)+ teR 

where fi and /2 stand for the densities of the normal distribution with mean 
— 1 and 1 and variance 1. Here, random variable X takes values in and 
we again consider two scenarios for its distribution: 

• “Circle-Square” model (see Bandry (2009)): gi^n is the density of the 
Gaussian distribution with mean (a, 0) and identity variance covariance 
matrix; g 2 ^n is the density of the uniform distribution over the square 
[—1,1]^ (see Figure 6 ). 

• “Concentric circles” model (see Ng et al. (2002)): gi^n is the density of 
the uniform distribution over C(0, ri + e,ri— e) and 5 ^ 2 ,n represents the 
uniform distribution over C( 0 , r 2 + e,r 2 — s), where for r > 0 and £ > 0 
C( 0 ,r + e,r — e) represents the set between circles with center 0 and 
radius r + e and r — e (see Figure 7). We £x ri = 0.3, e = 0.15 and 
consider many values for r 2 such that r 2 > ri + 2e. 
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The difficulty encountered in identifying each group depends on parameters 
a and r 2 . The smaller a and r 2 , the harder to identify the clusters. 




Figure 6; A sample of n = 250 observations for the “Circle-Square” model with 
a = 3 (left) and a = 4 (right). 




Figure 7: A sample of n = 250 observations for the “Concentric circles" model 
with r 2 = 0.75 (left) and r 2 = 0.80 (right). 

For the two described examples, we use the two-step kernel density estimator 
for three clustering algorithms: hierarchical, spectral and fc-means. The 
resulting estimates are compared with the oracle estimates /i and / 2 . We 
keep the same setting as above to compute estimates fi and f 2 . Gaussian 
kernel and bandwidth selected with the library np. For the sake of clarity. 
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we again present only results on /i since we observe the same conclusions for 
f 2 - Table 2 and Table 3 present the ratio 


Wi) 


EII/1-/1II1 

EII/1-/1II1’ 


(4.2) 


for many values of a, r 2 and n. Expectations are evaluated over 500 Monte- 
Carlo replications and Figure 8 presents boxplots of the Li-error of the dif¬ 
ferent estimates. For each replications, we also compute the error of the 
clustering procedure 

1 " 
k=l 

and we display in Table 2 and Table 3 this error term averaged over the 500 
replications (it is denoted err„). Observe that this term is closely related to 
the maximal misclassification error (p„. 



Hier. 

errn 

Sped. 

77. (/i) errn 

k-means 

77(/i) errn 

a = 3 

n = 250 

n = 500 

4.680 

6.370 

0.475 

0.483 

1.748 

2.265 

0.121 

0.126 

1.047 

1.034 

0.043 

0.043 

a = 4 

n = 250 

n = 500 

3.565 

5.688 

0.382 

0.449 

1.107 

1.190 

0.018 

0.023 

1.005 

1.000 

0.013 

0.013 

a = 5 

n = 250 

n = 500 

1.285 

1.897 

0.067 

0.130 

0.999 

0.999 

0.001 

0.001 

0.997 

1.000 

0.003 

0.003 


Table 2: Error ratio (4.2) evaluated over 500 Monte Carlo replications for the 
“Circle-Square" example. 



Hier. 

77 (/i) errn 

Sped. 

77 (/i) errn 

k-means 

77 (/i) errn 

ra = 0.75 

n = 250 

n = 500 

4.040 

1.197 

0.349 

0.021 

2.776 

1.013 

0.195 

0.001 

4.568 

5.993 

0.468 

0.478 

r2 = 0.80 

n = 250 

n = 500 

1.852 

1.010 

0.105 

0.001 

1.433 

1.000 

0.049 

0.000 

4.556 

5.986 

0.467 

0.477 


Table 3; Error ratio (4.2) evaluated over 500 Monte Carlo replications for the 
“Concentric circles" example. 


As proved in Theorem 2.1, performances of fi depend on the accuracy of the 
clustering approach: the lower err„, the better /i. For the “Circle Square” 
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dataset, unsurprisingly /c—means algorithm overperforms the two other clus¬ 
tering methods. Indeed, fc-means is well appropriate to this dataset since 
clusters can be identihed by their distances to two particular points (the 
centers of the uniform and Gaussian distributions). It is not the case for the 
“Concentric circle” dataset where estimates dehned from hierarchical and 
spectral clustering algorithms achieve the best estimated Li-error. 



Hier Spect KIVl OR Hier Spect KIVl OR 


Figure 8: Boxplot of the Li-error for the oracle estimate fi (OR) and two-step 
estimator /i using the hierarchical algorithm (Hier), spectral clustering algorithm 
(Spect) and /c-means algorithm (KM). Results are for “Circle-Square” dataset 
with a = 4 and n = 500 (left) and "Concentric circles” dataset with r 2 = 0.75 
and n = 500 (right). 


5 Application to electricity distribution 

5.1 Context of the study 

ERDF is the contract-holder of the public electricity distribution network 
in France. ERDF is in charge of operating, maintaining and developing 
the network. With 36,000 employees and 35 million customers served over 
34,220 communes, ERDF is the largest electricity distributor in Europe. It 
operates more than 1.3 million km in power lines and runs more than 11 
million operations per year. ERDF also plays an essential role in ensuring 
the proper functioning of the competitive electricity market by providing 
quality electricity supply among the best in Europe, and serving all network 
users without resorting to guaranteeing discriminatory practices. 
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In recent years, the electricity sector has entered a period of profound changes 
resulting from the emergence of decentralized and intermittent (wind, solar) 
means of production and new electricity uses (e.g. electric vehicle). The 
increasing integration of these new means of production and new uses has 
a major impact on ERDF’s core business: connecting new users (produc¬ 
ers, terminals electric vehicle), and adaptating rules of conduct and network 
planning/investment to meet the new specifications. ERDF has initiated 
its digital transformation plan so as to take advantage of new information 
technologies, and by meeting its new challenges, offer better public service. 

ERDF launched the “smart grid” experimental programs in order to run the 
network with more flexibility and efficiency. To do so, these programs use 
detailed network status and mine/produce information from different users. 
These more detailed data (including from a new generation of electricity 
meters, called smart meters) will accordingly be used to improve network 
monitoring (predictive maintenance). 

In this section we focus on the detection of customers who experience a 
signihcant decrease in consumption, for a given period of time, he., a period 
when overall malfunction of the network could be observed. This will make 
it possible to better understand the origin of dysfunctions and thus better 
forecast network operation. For this study, we have the beneht of a set of 
consumption curves for 226 customers with observations taken at regularly 
spaced instants. Based on the observation of the individual consumption 
curves, we can cluster individuals into two groups (those who have suffered an 
abnormal decline and the others) and estimate, in each group, distributions 
of many variables using the approach proposed in this paper. 

5.2 Application of the two-step estimator 

The consumption curves of n = 226 ERDF’s customers are observed at 9 
regularly spaced instants fi,... ,tg. The time interval [^ 1 ,^ 9 ] covers a known 
period of disruption between times and tg. The observations consist of n 
vectors ..., Z^g) G where Z^j stands for the consumption of 

user k at time tj. 

Since ERDF is interested in comparing the behavior of customers of both 
sub-populations (those who have suffered from the disruption and others) 
before and after the disruption period, we consider 6 different variables in 
relation with the consumption around the disruption period. These variables, 
presented below, are observed for each customer and thus are dehned for any 
k e |l,n]. 
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1 . Average consumptions before, during and after the disruption period 
defined by: 

Zk3 ^(2) _ Zki + Zk5 + Zk6 

— o 


^^( 1 ) _ Zkl + Zk2 

^k — ^ 


and vlf’ = 


Zkl + Zk% + Z] 


Jfc9 


2. Evolutions of consumption around the disruption period dehned by: 


■ 1 ^( 4 ) _ ^k 
-'A: ~ 




( 1 ) 


_ v'd) 

\^( 5 ) _ ^k ^k 
■'A: ~ 




( 1 ) 


and ^ 


Y, 


( 2 ) 


Let I be the random variable taking value 1 if a customer has been affected 
by the disruption, 2 otherwise. If we denote by fi^ and / 2 '^^ the conditional 
densities of = 1) and = 2), the problem is to compare 

with f 2 ^ for each j G |1,6]. Even if ERDF can measure consumptions 
during the disruption period (between and tg), it does not have the ca¬ 
pacity to identify consumers affected by the perturbation. It means that 
random variables Ik,k = 1,..., n are not observed. However, we know that 
users impacted by the disruption posted a decline in consumption during 
and fe- Figure 9 provides examples of customers potentially affected by the 
disruption (for conhdentiality reasons, representations are anonymous and 
scales of power are not specihed). 






Figure 9: Consumptions of users suspected to be affected (up) or not (down) 
by the perturbation. 

Using the approach developed in this paper, we hrst have to identify users 
impacted by the disruption with a clustering algorithm. As the disruption 
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influences the consumptions of user k between and tg we define Xk = 
{Xki,Xk 2 ),k = 1,... ,n with 

^kl = min (nA:,54, , Xk2 = Vk,54 + Vkfi5 


where 







Observe that Vk^ij measures the relative variation of consumption for user k 
between U and tj. It follows that Xk = {Xki,Xk 2 ) captures the development 
of consumption of user k during the disruption period. We use these covari¬ 
ates to cluster users into two groups: the hrst contains consumers assumed 
to be affected by the disruption, the second contains the others. 


Two clustering algorithms have been tested: the hierarchical method stud¬ 
ied in section 3.2 and the fc-means algorithm. Since these methods lead to 
approximately the same clusters, we only present results for the hierarchical 
method. Figures 10 and 11 present kernel density estimates (2.4) of con¬ 
ditional densities fi'^ and f 2 ^ for j G |1,6]. Parameters (bandwidth and 
kernel) of the kernel estimates are chosen as in the simulation part. For 
conhdentiality reasons, scales of power are again not specihed. 





Figure 10: Kernel estimates (solid lines) and (dashed lines) for j = 1 
(left), 2 (center) and 3 (right). 
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Figure 11: Kernel estimates (solid lines) and (dashed lines) for j = 4 
(left), 5 (center) and 6 (right). 

Figure 10 strongly supports the idea that the clustering procedure allows to 
correctly identify users impacted by the disruption. Indeed we observe that 
the average consumption during the disruption period is lower for consumers 
in the hrst group (second graph in Figure 10). We can also observe that 
average consumptions are quite the same for the two groups before and after 
the disruption period. It means that users impacted by the perturbation 
do not over-consume after the disruption period. This conclusion is also 
supported by the second graph in Figure 11: distributions representing the 
evolution of consumptions are similar for the two clusters. 

6 Conclusion 

This paper provides a new framework to estimate conditional densities in 
mixture models in the presence of covariates. To our knowledge, no clear 
probabilistic model has been proposed to take into account of the presence 
of covariates. The model we consider includes such covariates and Theorem 
2.1 precisely describes the interest of a preliminary clustering step on these 
covariates to estimate components of the mixture model. It is shown that 
the performances of these estimates depend on the maximal misclassihcation 
error (2.3) of the clustering algorithm. This criterion is natural to measure 
performances of clustering algorithms but, as far as we know, it has not been 
addressed before. We obtain non-asymptotic upper bounds of this error term 
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is section 3.2 for a particular hierarchical algorithm. This algorithm is not 
new but it has not been studied in this context. Results obtained for this 
algorithm could be extended to other clustering algorithms based on pairwise 
distances such as spectral clustering (Arias-Castro (2011)) or on clustering 
methods based on neighborhoods graphs (Maier et ah (2009)). Even if main 
contributions of this work are theoretical, both the simulation study and the 
application on real data enlighten the efficiency of the proposed estimator in 
the presence of covariates. 
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7 Proofs 


7.1 Proof of Theorem 2.1 

We hrst prove inequality (2.5). Since 


E 


fi - fi 


< E 


fi - fi 


E 


fi - fi 


we need only End an upper bound of the second term in the right-hand side 
of the previous inequality. Since fi = 0 when W = 0 and ||/i||i = ||A"||i, we 
have 


E 


fi - fi\l < E (P||^%=o) + E||(/i - /0%>o 
< ||A||i(l-«,)" + E (/,-/,)I^^>o 


For the sake of readability, let E denote the conditional expectation with 
respect to (/i,...,/„) and E the conditional expectation with respect to 
(Ji,..., 4, Xi,..., Xn). Moreover, let 


A{t) = [fi{t) - fi{t))lNi>0 


n 


= Y.K,{t,Y,) 

k=l 


f hi}ih) 
V A 


hi}ih) 

A 


^Ni>0- 


Using these notations it is easily seen that 

{fi-fi)lN,>o = EE /" E\Ai{t)\dt. 

1 .m 


E 


(7.1) 




































Since, for all ?/ G M we have lKh(t,y)ldt = ||if||i, we deduce that 


E|A,(t)|cit < / \Kh{t,Yk)\dt 


k=l 




N, 


<ll^lliE 

k=l 


hi}{h) I{i}(4) 




N, 


Thus 


E f E\Ai{t)\dt < ^^E 

JR iV,' 


-JZmviih) - N,ifif{h)\ 

k=l 


(7.2) 


Moreover, inserting A^jI{j}(Jfc) in the previous expectation, we obtain 


E 


1 "■ ^ 
k=l 


< E|iV, - El + E ^ |Ip}(Jfc) - %(4)| 


k=l 


— 2 E E l^th dk) — I{i} ih) I • 


(7.3) 


k=l 


Combining (7.1), (7.2) and (7.3) leads to 


E 


(/7-/7)%>o||^<2||ir||iEE 

2||i^|li ^E 


^Ni>0 


lip}(4) - lp}(4)| 


< 


k=l 

” nad 


na 


* k=l 




|lp}(4) - lp}(4)| 


(7.4) 


The expectation on the right-hand side of this inequality can be bounded in 
the following way 


E 


nailN.-^o 

^4 


|Ip}(4) - I{i}(4)| 


< E 


naAN.^o 


E 


V 

nailN^yo 

N, 


|Ip}(4) - Ip}(4)|Il!p<2 
|Ip}(4) - Ip}(4)|Ir};^>2 


For the first term of this bound, we have 
naiI]\[i>o I 


E 




-|Ip}(4) -IIp}(4)|l!p< 




(7.5) 


(7.6) 
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while for the second term, we obtain from Holder inequality that 


E 


nail 






< 


< 


E 


nailNi>o 


| I { i }( 4 ) - l { i }{ Ik )\^:!^>2 


^ f nai 


\ 


E 


{naif 

Nf 


lAr^>o P iVj - na* < 


na, 


(7.7) 


Now, it can be easily seen that 

(inaiY \ 

^ ® 


{naif 


XN, + l){N, + 2)^ 


< 6 , 


(7.8) 


where the last inequality follows from Hengartner and Matzner-Lqber (2009). 
Using Hoeffding’s inequality (see Hoeffding (1963)) we obtain for the second 
term in (7.5) 


E 


najlNiX) 


|I{i}(4) - II{i}(7fc)|l!^>2 


< ^ 


naj 


S voexp 


(7.9) 


From (7.4) - (7.9), we deduce that 


E 


{fi - fi)^Ni>0 


<4J^ 2v/6||A-||, ^^p/_^ 

1 a,: a,; V 4 


Putting all of the pieces together, we obtain 


E 


fi - fi 


1 o^i di v 4 > 


||/7||iexp (-nlog(l - af ), 


which concludes the hrst part of the proof. 
Inequality (2.6) is proved as follows 


1 < E 



+ E 

N, 

d-i 


n 

n 
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1 _ 1 /- 

< -1: E|i(o(4) -1(0 (4) I + -s/nm 
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^ T’n + 
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7.2 Proof of Proposition 3.1 


Let k be an arbitrary integer in lljn]. We have to bonnd P(/fc 7 ^ i\Ik = i) 
for i = 1,2. To do so, we first consider the case i = 2: 

P(Jfc ^ 2 \h = 2) = P (4 7^ 2,1 - < Xfc < l\h = 2) 

+ P(4 7^2,Xfc>l|4 = 2) 

- Xn < Xk < l\Ik = 2) 


because, by dehnition, Jfc 7 ^ 2 


Xk <1. Thus 


P(4 7 ^ 2|4 = 2) = g2,n{x)dx = 


'l-\n 


(7.10) 


Next, if z = 1 it is easy to see that P(Jfc 7 ^ l|Jfc = 1) = P(-^fc > 1 —An|4 = 1). 
Let us consider 


_ 2 log n 

f^n '^n “r 


and 4=|l — A„>1 — /i„|. 


02 n 

Using these notations we obtain 

{Xfc > 1 - A„} = [{Xk > 1 - A„} n 4) u {Xfc > 1 - A„} n A) 

A [Xk > 1 - /in} U |An > /in} • 

This leads to the following inequality 

P(4 ^ 1|4 = 1) < /in + P (X(„) < 2 - fXnlh = 1 ) . (7.11) 

Since Xg and 4 are independent for k A ^1 obtain the following bound 
for the last probability 

P (-^(n) < 2 — /X„|4 = 1 ) 

l,Xi <2 — /in|4 = 1) 


= nP(^^<2-/x„) P(Xfc<2-/i,|4 = l). 

The independence of the X^’s and simple calculations lead to 

P (-^(n) < 2 — /in|4 ~ 1 ) ~ < 2 — /i„)} 

= (1 — 2 n“^(logn))”“^ 


< n 


-1 


(7.12) 
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where the last inequality follows, for n > 2, from the fact that 1 — m < e “ 
for all M > 0. Taking together equations (7.11) and (7.12), we hnally obtain 

7 ^ l|7fc = l)<A„ + n -■-. (7-13) 

02 n 

Proposition follows from equations (7.10) and (7.13). 


7.3 Proof of Theorem 3.1 

Since 6n > 2r„ we have for all {i,j) G |1,M]^ with i ^ j: 


U B{Xk, rO n U B{Xk, Tn) C (5i,„ + r„) n (S',,,, + r„) = 0, 

/ \k-.Xk&Sj,„ / 

(7.14) 

where, for S' C and r > 0, we recall that 

S + r = {x : 3y e S such that ||a: — y \\2 < r}. 

Inclusion (7.14) implies > M. Moreover, observe that if 

rn G 77m = {r > 0 : Mr < M} (7-15) 

then Mr^ = M and the affinity matrices and defined in (3.3) induce 
the same clusters T’i(r„),..., AM{rn)- Furthermore, if (7.15) is verihed, it is 
easily seen that Vi G |1, M], 3j G |1, M] such that 


{Xk ■ Xk G Sj,n + Tn} C Aj(rn)- 

For simplicity, when (7.15) is satished, we index clusters T’i(r„),..., AM^Tn) 
such that 

{Xk : Xk G Sj,„ + r„} C ^^(r^), i G |1, M]. 

We deduce that 


IP(7fe 7^ Ik) < lP({7fc 7^ Ik} n {r„ G 77 m}) + 1P (^"n ^ 77m) 

< lP({7fc 7^ Ik] n {Vn G 77m} n {Xk G {Sn + Tn)}) 

+ lP(Wfc ^ (S'n + r„)) + P(r„ ^ 77 m) 

M 

— ^ ’"n)|7fc = *)P(7a: = f) + '0n + ^(’"n ^ Hm) 

i=l 

< 2V’n + P(?^n ^ 77 m) (7.16) 
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since P(Xfc ^ (S'„ + r„)) < xjjn- To complete the proof, we have to hnd an 
upper bound for the probability of the event {r„ ^ 'R-m}- Observe that 

P(r. i Um) <P 1 2 U Tn) 


k^Kn 


+ P {r„ ^ TZm} n < C IJ B{Xk, Tn) 

\ fcSKn 

(7.17) 

where Kn = {k & [l,Tf] : Xk G Sn}- For the hrst term on the right hand 
side of the above equation, remark that inclusion 

Sn^ [j B{X,,rn) 

k£Kn 

holds when for all i G |1,A^], the balls dehned in assumption 2 contain 
at least one observation among {X^, k G Thus 

p[^„^ U B{Xk,r^)\ <F{3i e ll,Niyk e Kn,Xk ^ Be) 

\ k&Kn ) 

N 

< ^P(V/i; G Uni Xk ^ Be) 

i=i 

N / n \ 

< n e Sr,} n {Xfc ^ Be}} U {Xk i 

£=1 \fc=l / 

N 

< e Sr,} n {Xfc ^ Be}) + P(Xfc ^ Sr,))^ 


e=i 

N 

< ^(1 - P(Xfc G {Be n ^0) - nXk i Sn) + P(Xfc i Sn)y 
£=1 
N 

<^(i-P(XfcG(5,n^0)r- 

l=l 


According to assumption 2 and inequality (3.6), we obtain 


N 


d\n 


P ^ U B{Xk,rn) < E(1 - tnC2ri) 


k£K,n 


e=i 


< 


N(l-C2tnri)' 


n 


< (rcics) -exp(-C2nt„r„) 

logn 

Ti 

< {tciC2)~^- -exp(-C2rlogn). 

logn 
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Since C 2 T > 1 + a we have 


P S', ^ U B{Xk,rn) < (rciC2)-'— . 

' / n“logn 


(7.18) 


For the second term on the right hand side of (7.17), we have 


P {r„ ^ TZm} n < IJ B{Xk,rn) > < P(3fc e |l,n] : Xk ^ (5„+r„)) < nijn 

\ [fcSKn J / 

(7.19) 

Taking (7.16), (7.18) and (7.19) together, result follows. 
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